Deep neural networks have produced large accuracy gains in applications such as computer vision, speech recognition and natural language processing. Rapid advancements in this area have been supported by excellent libraries for developing neural networks. These libraries allow users to express neural networks in terms of a computation graph, i.e., a sequence of mathematical operations, that maps vector inputs to outputs. Given this graph, these libraries support learning network parameters from data and furthermore can optimize computation, for example, by using GPUs. The ease of developing neural networks with these libraries is a significant factor in the proliferation of these models.

Unfortunately, structured prediction problems cannot be easily expressed in neural network libraries. These problems require making a set of interrelated predictions and occur frequently in natural language processing. For example, part-of-speech tagging requires assigning a part-of-speech tag to each word in a sentence, where tags for nearby words may depend on each other. Another example is dependency parsing, which requires predicting a directed tree spanning the words of a sentence. These problems can be posed as predicting a sequence of actions, such as shifts and reduces for parsing. Neural networks can be applied to these problems using locally-normalized models, which produce a probability distribution over each action given all prior actions. However, this approach is known to be suboptimal as it suffers from the label bias problem, which is that the scores of future actions cannot affect the scores of previous actions. Globally-normalized models, such as conditional random fields, solve this problem by defining a joint distribution over all sequences of actions. These models express a larger class of distributions and provide better predictions than locally-normalized models. Unfortunately, these models cannot be easily expressed or trained using existing neural network libraries.

Our goal is to design a neural network library that supports structured prediction with globally-normalized models. Our observation is that nondeterministic computation is a natural formalism for expressing the decision space of a structured prediction problem. For example, we can express a parser as a function that nondeterministically selects a shift or reduce action, applies it to its current state, then recurses to parse the remainder of the sentence. Our library has a nondeterministic choice operator similar to McCarthy's amb for this purpose. We combine nondeterministic choice with a computation graph, which, as demonstrated by neural network libraries, is a powerful way for users to express models. Importantly, the computation graph interacts with nondeterministic computation: the scores produced by the neural network determine the probabilities of nondeterministic choices, and the nondeterministic choices determine the network's architecture. The combination of these operations results in a powerful formalism for structured prediction where users specify the decision space using nondeterminism and the prediction models using computation graphs.

We implement our library as a Scala monad to leverage Scala's powerful type system, library support, and streamlined syntax for monadic operations. A instance Pp[X] of the monad represents a function from neural network parameters to a probability distribution over values of type X. The monad supports inference operations that take neural network parameters and return (an approximation of) this distribution. Running inference also generates a neural network and performs a forward pass in this network to compute its outputs, which may influence the probability distribution. The library also supports a variety of optimization algorithms for training parameters from data. These training algorithms use the approximate inference algorithms and also run backpropagation in the generated networks.


In [1]:
val path = "/Users/jayantk/github/pnp/"
classpath.addPath(path + "target/scala-2.11/pnp_2.11-0.1.jar")
classpath.addPath(path + "lib/jklol.jar")
classpath.add("com.google.guava" % "guava" % "17.0")


1 new artifacts in macro
1 new artifact(s)
1 new artifacts in runtime
1 new artifacts in compile
path: String = "/Users/jayantk/github/pnp/"

To use the Pp monad, we statically import its functions. Nondeterministic choices are expressed using the choose function, which has several forms. The simplest takes an explicit list of values and their corresponding probabilities:


In [1]:
import org.allenai.pnp.Pp
import org.allenai.pnp.Pp._

val flip = choose(Seq(true, false), Seq(0.75, 0.25))


Main.scala:23: object allenai is not a member of package org
              import org.allenai.pnp.Pp ; import org.allenai.pnp.Pp._ ; val flip = { () =>
                                                     ^
Main.scala:24: not found: value choose
choose(Seq(true, false), Seq(0.75, 0.25)) 
^
Main.scala:23: object allenai is not a member of package org
              import org.allenai.pnp.Pp ; import org.allenai.pnp.Pp._ ; val flip = { () =>
                         ^

This code represents a weighted coin flip that comes up true with probability 0.75 and false with probability 0.25. flip has type Pp[Boolean], representing a distribution over values of type Boolean. The Pp object represents this distribution lazily, so we need to perform inference on this object to get the probability of each value. Many different inference algorithms could be used for this purpose, such as sampling. Here we perform inference using beam search to estimate the k=10 most probable outcomes:


In [3]:
val dist = flip.beamSearch(10)


dist: Seq[(Boolean, Double)] = ArrayBuffer((true, 0.75), (false, 0.25))

We get back the original distribution, which, while expected, is not that interesting. To construct interesting models, we need to combine multiple nondeterministic choices, which we can do conveniently using Scala's for/yield construction:


In [4]:
val threeFlips = for {
    flip1 <- choose(Seq(true, false), Seq(0.75, 0.25))
    flip2 <- choose(Seq(true, false), Seq(0.75, 0.25))
    flip3 <- choose(Seq(true, false), Seq(0.75, 0.25))
} yield {
    flip1 && flip2 && flip3
}

val threeFlipsDist = threeFlips.beamSearch(10)


threeFlips: Pp[Boolean] = BindPp(CategoricalPp(List((true,-0.2876820724517809), (false,-1.3862943611198906))),<function1>)
threeFlipsDist: Seq[(Boolean, Double)] = ArrayBuffer(
  (true, 0.42187500000000006),
  (false, 0.140625),
  (false, 0.140625),
  (false, 0.140625),
  (false, 0.046875000000000014),
  (false, 0.04687499999999999),
  (false, 0.04687499999999999),
  (false, 0.015625000000000007)
)

This code flips three weighted coins and returns true if they all come up true. Inference again returns the expected distribution, which we can verify by computing 0.75 * 0.75 * 0.75 = 0.421. There are 8 entries in the returned distribution because each entry represents a way the program can execute, that is, a particular sequence of choices. Executions have additional state (specifically a computation graph) besides the return value, which is why the "duplicate" entries are not collapsed by inference. We'll see how to use the computation graph later on.

The for/yield construction above essentially means "make these choices sequentially and lazily". This construction is mapped by Scala into calls to flatMap on the Pp monad, which construct a lazy representation of the probability distribution.

Basically, the for/yield translates into a

We can also define functions with nondeterministic choices. The flipMany function flips num coins and returns a list of the results:


In [17]:
def flipMany(num: Int): Pp[List[Boolean]] = {
    if (num == 0) {
        value(List())
    } else {
        for {
            flip <- choose(Seq(true, false), Seq(0.75, 0.25))
            rest <- flipMany(num - 1)
        } yield {
            flip :: rest
        }
    }
}

val flips = flipMany(1000)
val flipsDist = flips.beamSearch(10)


defined function flipMany
flips: Pp[List[Boolean]] = BindPp(CategoricalPp(List((true,-0.2876820724517809), (false,-1.3862943611198906))),<function1>)
flipsDist: Seq[(List[Boolean], Double)] = ArrayBuffer(
  (
    List(
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
      true,
...

Recursive functions such as flipMany are a natural way to express many structured prediction models. We will see later how to define a sequence tagger using a function very similar to flipMany.

At this point, it is worthwhile noting that inference does not explicitly enumerate all of the possible executions of the program, even though the for/yield syntax suggests this interpretation. We can verify that explicit enumeration does not occur by changing the argument 4 above to something large, such as 10000, that would make enumeration computationally infeasible. Inference still completes quickly because beamSearch only approximately searches the space of executions.

The examples above demonstrate how to use Pp to express a distribution over values in terms of probabilistic choices and perform inference over the results. The monad also allows us to define computation graphs for neural networks. Here's a function implementing a multilayer perceptron:


In [6]:
import scala.collection.JavaConverters._
import org.allenai.pnp.{ Env, CompGraph, CompGraphNode }
import com.jayantkrish.jklol.tensor.{ Tensor, DenseTensor }
import com.jayantkrish.jklol.util.IndexedList
import com.google.common.base.Preconditions

def multilayerPerceptron(featureVector: Tensor): Pp[CompGraphNode] = for {
    params <- param("params")
    bias <- param("bias")
    hidden = ((params * featureVector) + bias).tanh
    params2 <- param("params2")
    bias2 <- param("bias2")
    dist = (params2 * hidden) + bias2
} yield {
    dist
}


import scala.collection.JavaConverters._
import org.allenai.pnp.{ Env, CompGraph, CompGraphNode }
import com.jayantkrish.jklol.tensor.{ Tensor, DenseTensor }
import com.jayantkrish.jklol.util.IndexedList
import com.google.common.base.Preconditions
defined function multilayerPerceptron

This function applies a two-layer neural network to an input feature vector. The feature vector is represented as a Tensor from the jklol library. Neural network parameters are retrieved by name using the param function and are instances of CompGraphNode. The function builds a neural network out of these nodes by applying standard operations such as matrix/vector multiplication, addition, and the hyperbolic tangent. These operations are overloaded to create new nodes in the computation graph.

Something that may seem odd is that the return type for this method is Pp[CompGraphNode], that is, a distribution over computation graph nodes. The reason for this type is that operations on CompGraphNode are stateful and manipulate an underlying computation graph. However, this graph is tracked by the Pp monad so it doesn't need to be explicitly referenced. This type also enables the computation graph to interact with nondeterministic choices, as we will see later.

Let's evaluate our neural network with a feature vector and some random parameters:


In [7]:
val features = new DenseTensor(
    // Dimension specification for the tensor
    Array[Int](2), Array[Int](2),
    // The tensor's values
    Array(1, 1))

// Evaluate the network with the feature
val nnPp = multilayerPerceptron(features)

// Randomly initialize the network parameters
val compGraph = new CompGraph(
    IndexedList.create(List("params", "bias", "params2", "bias2").asJava),
    Array(DenseTensor.random(Array(2, 1), Array(2, 8), 0.0, 1.0),
          DenseTensor.random(Array(1), Array(8), 0.0, 1.0),
          DenseTensor.random(Array(0, 1), Array(2, 8), 0.0, 1.0),
          DenseTensor.random(Array(0), Array(2), 0.0, 1.0)))

val dist = nnPp.beamSearch(10, Env.init, compGraph).executions
val nnOutput = dist(0).value.value


features: DenseTensor = [DenseTensor [0]=1.0, [1]=1.0, ]
nnPp: Pp[CompGraphNode] = BindPp(ParameterPp(params,-1),<function1>)
compGraph: CompGraph = org.allenai.pnp.CompGraph@494e13f5
dist: Seq[org.allenai.pnp.Execution[CompGraphNode]] = Array([Execution org.allenai.pnp.PlusNode@168c8d2d 0.0])
nnOutput: Tensor = [DenseTensor [0]=1.3619915424830156, [1]=3.2401367678101494, ]

To evaluate the network we define values for the feature vector as well as the named network parameters. These parameters are declared by creating a computation graph containing them. This graph is then passed to the beamSearch method which performs the forward pass of inference in the neural network. Inference returns a distribution with a single CompGraphNode that contains the network's output as its value. The output is a tensor with two values, which is all we can expect given the random inputs.

Again, it may seem odd that beamSearch computes the network's forward pass in addition to performing probabilistic inference. This overloading is intentional, as it allows us to combine neural networks with nondeterministic choices to create probability distributions. Let's use our multilayer perceptron to define a probability distribution:


In [8]:
def booleanFunction(left: Boolean, right: Boolean): Pp[Boolean] = {
    // Build a feature vector from the inputs
    val values = Array.ofDim[Double](2)
    values(0) = if (left) { 1 } else { 0 }
    values(1) = if (right) { 1 } else { 0 }
    val featureVector = new DenseTensor(
        Array[Int](2), Array[Int](2), values)
  
    for {
        dist <- multilayerPerceptron(featureVector)
        output <- choose(Array(false, true), dist)
    } yield {
        output
    }
}

val output = booleanFunction(true, true)
val dist = output.beamSearch(10, Env.init, compGraph)
for (d <- dist.executions) {
    println(d.value + " " + (d.prob / dist.partitionFunction))
}


true 0.8673979379202605
false 0.1326020620797394
defined function booleanFunction
output: Pp[Boolean] = BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>)
dist: org.allenai.pnp.PpBeamMarginals[Boolean] = org.allenai.pnp.PpBeamMarginals@3b9aecd

This function computes a distribution over boolean outputs given two boolean inputs. It encodes the inputs as a feature vector, passes this vector to the multilayer perceptron, then uses the output of the perceptron to choose the output. Note that the input '(true, true)' is encoded as the same feature vector used in the above code block, and that the network has the same parameters. If you look closely at the output of the two above cells, you'll notice that 113.15 = e^4.72 and 12.73 = e^2.54. That's because the values computed by the neural network are logspace weights for each execution. Normalizing these weights gives us a probability distribution over executions.

Next, let's train the network parameters to learn the xor function. First, let's create some training examples:


In [9]:
import org.allenai.pnp.PpExample

// Create training data.
val data = List(
  (true, true, false),
  (true, false, true),
  (false, true, true),
  (false, false, false)
)
val examples = data.map(x => {
  val unconditional = booleanFunction(x._1, x._2)
  val conditional = for {
    y <- unconditional;
    x <- Pp.require(y == x._3)
  } yield {
    y
  }
  PpExample.fromDistributions(unconditional, conditional)
})


import org.allenai.pnp.PpExample
data: List[(Boolean, Boolean, Boolean)] = List(
  (true, true, false),
  (true, false, true),
  (false, true, true),
  (false, false, false)
)
examples: List[org.allenai.pnp.PpExample[Boolean]] = List(
  PpExample(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),BindPp(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),<function1>),org.allenai.pnp.Env@6fb8673a,<function1>),
  PpExample(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),BindPp(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),<function1>),org.allenai.pnp.Env@51d85bfd,<function1>),
  PpExample(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),BindPp(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),<function1>),org.allenai.pnp.Env@3c871636,<function1>),
  PpExample(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),BindPp(BindPp(BindPp(ParameterPp(params,-1),<function1>),<function1>),<function1>),org.allenai.pnp.Env@41b1423b,<function1>)
)

A training example consists of a tuple of an unconditional and a conditional probability distribution. The unconditional distribution is generated by calling booleanFunction, and the conditional distribution is generated by constraining the unconditional distribution to have the correct label using require. Using these examples, we can train the network:


In [10]:
import com.jayantkrish.jklol.models.DiscreteVariable
import com.jayantkrish.jklol.models.VariableNumMap
import com.jayantkrish.jklol.training.{ StochasticGradientTrainer, Lbfgs }
import com.jayantkrish.jklol.training.NullLogFunction
import org.allenai.pnp.ParametricPpModel
import org.allenai.pnp.PpLoglikelihoodOracle

// Initialize neural net parameters and their dimensionalities.
// TODO(jayantk): This is more verbose than necessary
val v = DiscreteVariable.sequence("boolean", 2);
val h = DiscreteVariable.sequence("hidden", 8);
val inputVar = VariableNumMap.singleton(2, "input", v)
val hiddenVar = VariableNumMap.singleton(1, "hidden", h)
val outputVar = VariableNumMap.singleton(0, "output", v)
val paramNames = IndexedList.create(
   List("params", "bias", "params2", "bias2").asJava
)

val family = new ParametricPpModel(
  paramNames,
  List(inputVar.union(hiddenVar), hiddenVar,
    hiddenVar.union(outputVar), outputVar)
);

// Train model
val oracle = new PpLoglikelihoodOracle[Boolean](100, family)
val trainer = StochasticGradientTrainer.createWithL2Regularization(
  1000, 1, 1.0, false, false, 10.0, 0.0, new NullLogFunction()
)
val params = family.getNewSufficientStatistics
params.perturb(1.0)
val trainedParams = trainer.train(oracle, params, examples.asJava)
val model = family.getModelFromParameters(trainedParams)


import com.jayantkrish.jklol.models.DiscreteVariable
import com.jayantkrish.jklol.models.VariableNumMap
import com.jayantkrish.jklol.training.{ StochasticGradientTrainer, Lbfgs }
import com.jayantkrish.jklol.training.NullLogFunction
import org.allenai.pnp.ParametricPpModel
import org.allenai.pnp.PpLoglikelihoodOracle
v: com.jayantkrish.jklol.models.DiscreteVariable = boolean (2 values)
h: com.jayantkrish.jklol.models.DiscreteVariable = hidden (8 values)
inputVar: com.jayantkrish.jklol.models.VariableNumMap = [2:input=boolean (2 values),]
hiddenVar: com.jayantkrish.jklol.models.VariableNumMap = [1:hidden=hidden (8 values),]
outputVar: com.jayantkrish.jklol.models.VariableNumMap = [0:output=boolean (2 values),]
paramNames: IndexedList[String] = [params, bias, params2, bias2]
family: org.allenai.pnp.ParametricPpModel = org.allenai.pnp.ParametricPpModel@46633563
oracle: org.allenai.pnp.PpLoglikelihoodOracle[Boolean] = org.allenai.pnp.PpLoglikelihoodOracle@31b5d6ea
trainer: com.jayantkrish.jklol.training.StochasticGradientTrainer = com.jayantkrish.jklol.training.StochasticGradientTrainer@41091457
params: com.jayantkrish.jklol.models.parametric.SufficientStatistics = [TableFactor([1:hidden=hidden (8 values),2:input=boolean (2 values),])(16 weights), TableFactor([1:hidden=hidden (8 values),])(8 weights), TableFactor([0:output=boolean (2 values),1:hidden=hidden (8 values),])(16 weights), TableFactor([0:output=boolean (2 values),])(2 weights)]
trainedParams: com.jayantkrish.jklol.models.parametric.SufficientStatistics = [TableFactor([1:hidden=hidden (8 values),2:input=boolean (2 values),])(16 weights), TableFactor([1:hidden=hidden (8 values),])(8 weights), TableFactor([0:output=boolean (2 values),1:hidden=hidden (8 values),])(16 weights), TableFactor([0:output=boolean (2 values),])(2 weights)]
model: org.allenai.pnp.PpModel = org.allenai.pnp.PpModel@22bc6f1b

TODO (jayantk): The syntax for declaring parameter dimensionalities is overcomplicated at the moment.

The first part of the above code declares the dimensionalities for each parameter of the network, and the second part trains these parameters using 1000 iterations of stochastic gradient descent. Let's evaluate our function with the trained parameters:


In [11]:
val marginals = booleanFunction(false, true).beamSearch(
    100, Env.init, model.getInitialComputationGraph)

val dist = marginals.executions


marginals: org.allenai.pnp.PpBeamMarginals[Boolean] = org.allenai.pnp.PpBeamMarginals@3caff273
dist: Seq[org.allenai.pnp.Execution[Boolean]] = Array([Execution true 6.581252026716094], [Execution false -1.7194011072095412])

For the input (true, true), the predicted output is overwhelmingly likely to be false. You can change the arguments above to convince yourself that we have indeed learned the xor function.

Of course, any neural network library can be used to learn xor. The power of the interaction between neural networks and nondeterministic choices only appears when we start doing structured prediction. Let's build a shift-reduce parser for a context free grammar. First, let's create some data structures to represent parse trees:


In [12]:
abstract class Parse(val pos: String) {
  def getTokens: List[String]
}

case class Terminal(word: String, override val pos: String) extends Parse(pos) {
  override def getTokens: List[String] = {
    List(word)
  }

  override def toString: String = {
    pos + " -> " + word
  }
}

case class Nonterminal(left: Parse, right: Parse, override val pos: String) extends Parse(pos) {
  override def getTokens: List[String] = {
    left.getTokens ++ right.getTokens
  }

  override def toString: String = {
    pos + " -> (" + left.toString + ", " + right.toString + ")"
  }
}


defined class Parse
defined class Terminal
defined class Nonterminal

Let's also create some data structures to represent the parser's state and actions:


In [13]:
import scala.collection.mutable.ListBuffer
import scala.collection.mutable.MultiMap
import scala.collection.mutable.{ HashMap, Set }

// The state of a shift-reduce parser consists of a buffer
// of tokens that haven't been consumed yet and a stack
// of parse trees spanning the consumed tokens.  
case class ShiftReduceState(tokens: List[String], stack: List[Parse])

// Classes for representing actions of a shift/reduce
// parser.
abstract class Action {
  def apply(state: ShiftReduceState): ShiftReduceState
  
  // An action is responsible for generating the computation graph
  // that scores it.
  def score(state: ShiftReduceState): Pp[CompGraphNode]
  def addParams(names: IndexedList[String], vars: ListBuffer[VariableNumMap]): Unit
}

// The shift action consumes the first token on the buffer
// and pushes a parse tree on to the stack.
class Shift(val word: String, val pos: String) extends Action {
  val terminal = Terminal(word, pos)
  val paramName = "shift_" + word + "_" + pos

  override def apply(state: ShiftReduceState): ShiftReduceState = {
    ShiftReduceState(state.tokens.tail, (terminal :: state.stack))
  }

  override def score(state: ShiftReduceState): Pp[CompGraphNode] = {
    Pp.param(paramName)
  }

  override def addParams(names: IndexedList[String], vars: ListBuffer[VariableNumMap]): Unit = {
    names.add(paramName)
    vars += VariableNumMap.EMPTY
  }
}

// The reduce action combines the top two parses on the stack
// into a single parse.
class Reduce(val leftPos: String, val rightPos: String, val rootPos: String) extends Action {
  val paramName = "reduce_" + leftPos + "_" + rightPos + "_" + rootPos
  
  override def apply(state: ShiftReduceState): ShiftReduceState = {
    val left = state.stack(1)
    val right = state.stack(0)
    val nonterminal = Nonterminal(left, right, rootPos)
    ShiftReduceState(state.tokens, (nonterminal :: state.stack.drop(2)))
  }

  override def score(state: ShiftReduceState): Pp[CompGraphNode] = {
    Pp.param(paramName)
  }

  override def addParams(names: IndexedList[String], vars: ListBuffer[VariableNumMap]): Unit = {
    names.add(paramName)
    vars += VariableNumMap.EMPTY
  }
}


import scala.collection.mutable.ListBuffer
import scala.collection.mutable.MultiMap
import scala.collection.mutable.{ HashMap, Set }
defined class ShiftReduceState
defined class Action
defined class Shift
defined class Reduce

Using these data structures, we can define the parser as follows:


In [14]:
class PpParser(
    lexActions: MultiMap[String, Action],
    grammarActions: MultiMap[(String, String), Action]
) {

  def parse(sent: List[String]): Pp[Parse] = {
    parse(ShiftReduceState(sent, List()))
  }

  def parse(state: ShiftReduceState): Pp[Parse] = {
    val tokens = state.tokens
    val stack = state.stack
    if (tokens.size == 0 && stack.size == 1) {
      // All tokens consumed and all possible
      // reduces performed.
      value(stack.head)
    } else {
      // Queue for each possible action
      val actions = ListBuffer[Action]()

      // Queue shift actions
      if (tokens.size > 0) {
        val shifts = lexActions.getOrElse(tokens.head, Set())
        actions ++= shifts
      }

      // Queue reduce actions
      if (stack.size >= 2) {
        val left = stack(1)
        val right = stack(0)
        val reduces = grammarActions.getOrElse((left.pos, right.pos), Set())
        actions ++= reduces
      }

      for {
        // Score each possible action, nondeterministically
        // select one to apply, then recurse on the next
        // parser state.
        scores <- scoreActions(state, actions);
        action <- choose(actions.toArray, scores)
        p <- parse(action.apply(state))
      } yield {
        p
      }
    }
  }

  def scoreActions(state: ShiftReduceState, actions: ListBuffer[Action]): Pp[Array[CompGraphNode]] = {
    val scoreList = actions.foldRight(Pp.value(List[CompGraphNode]()))((action, list) =>
      for {
        x <- action.score(state);
        l <- list
      } yield {
        x :: l
      })

    scoreList.flatMap { x => Pp.value(x.toArray) }
  }

  // Generate the parameters used in the neural network
  // of this parser.
  def getParams: ParametricPpModel = {
    val paramNames = IndexedList.create[String]()
    val paramVars = ListBuffer[VariableNumMap]()

    lexActions.values.foreach(_.foreach(_.addParams(paramNames, paramVars)))
    grammarActions.values.foreach(_.foreach(_.addParams(paramNames, paramVars)))

    new ParametricPpModel(paramNames, paramVars.toList)
  }
}

object PpParser {
  // Initialize parser actions from maps of string -> part of speech
  // for shift actions and (pos, pos) -> pos for reduce actions
  def fromMaps(
    lexicon: List[(String, String)],
    grammar: List[((String, String), String)]
  ): PpParser = {

    val lexActions = new HashMap[String, Set[Action]] with MultiMap[String, Action]
    for ((k, v) <- lexicon) {
      lexActions.addBinding(k, new Shift(k, v))
    }

    val grammarActions = new HashMap[(String, String), Set[Action]] with MultiMap[(String, String), Action]
    for ((k, v) <- grammar) {
      grammarActions.addBinding(k, new Reduce(k._1, k._2, v))
    }

    new PpParser(lexActions, grammarActions)
  }
}


defined class PpParser
defined object PpParser

All said and done, we've built a globally-normalized shift-reduce parser with neural network factors in less than 200 lines of code, much of which is simple data structures. The scoring function for actions is overly simple, but this is easy to improve using the computation graph operations as we saw in the xor example. Let's use the parser to parse some sentences:


In [15]:
// The set of shift actions
val lexicon = List(
  ("the", "DT"),
  ("the", "NN"),
  ("blue", "NN"),
  ("man", "NN")
)

// The set of reduce actions
val grammar = List(
  (("DT", "NN"), "NP"),
  (("NN", "NN"), "NN")
)

val parser = PpParser.fromMaps(lexicon, grammar)

// Get the neural network parameters needed to score
// parses in dist
val family = parser.getParams
val model = family.getModelFromParameters(
    family.getNewSufficientStatistics)
val cg = model.getInitialComputationGraph

// Run inference
val dist = parser.parse(List("the", "blue", "man"))
val marginals = dist.beamSearch(100, Env.init, cg)
val parses = marginals.executions


lexicon: List[(String, String)] = List(("the", "DT"), ("the", "NN"), ("blue", "NN"), ("man", "NN"))
grammar: List[((String, String), String)] = List((("DT", "NN"), "NP"), (("NN", "NN"), "NN"))
parser: PpParser = cmd13$$user$PpParser@73815acf
family: ParametricPpModel = org.allenai.pnp.ParametricPpModel@39b76121
model: org.allenai.pnp.PpModel = org.allenai.pnp.PpModel@2dfe7c9c
cg: CompGraph = org.allenai.pnp.CompGraph@df178
dist: Pp[Parse] = BindPp(BindPp(BindPp(ParameterPp(shift_the_DT,-1),<function1>),<function1>),<function1>)
marginals: org.allenai.pnp.PpBeamMarginals[Parse] = org.allenai.pnp.PpBeamMarginals@5aadd72a
parses: Seq[org.allenai.pnp.Execution[Parse]] = Array(
  [Execution NP -> (DT -> the, NN -> (NN -> blue, NN -> man)) 0.0],
  [Execution NN -> (NN -> the, NN -> (NN -> blue, NN -> man)) 0.0],
  [Execution NN -> (NN -> (NN -> the, NN -> blue), NN -> man) 0.0]
)

Parsing seems to work. We initialized the parameters to 0, so we get a uniform distribution over parse trees for the sentence. Now let's train the parser:


In [16]:
val trainingData = List[Parse](
    Nonterminal(Terminal("the", "DT"), Terminal("man", "NN"), "NP"),
    Nonterminal(Terminal("blue", "NN"), Terminal("man", "NN"), "NN")
    )

// Convert the trees to unconditional / conditional distributions
// that are used for training.
val examples = trainingData.map(tree => {
    val unconditional = parser.parse(tree.getTokens)
    val conditional = for {
        parse <- unconditional;
        _ <- Pp.require(parse == tree)
    } yield {
        parse
    }
    PpExample.fromDistributions(unconditional, conditional)
})

val oracle = new PpLoglikelihoodOracle[Parse](100, family)
val trainer = StochasticGradientTrainer.createWithL2Regularization(
  1000, 1, 1.0, false, false, 10.0, 0.0, new NullLogFunction())
val params = family.getNewSufficientStatistics
val trainedParams = trainer.train(oracle, params, examples.asJava)
val model = family.getModelFromParameters(trainedParams)

val dist = parser.parse(List("the", "blue", "man"))
val marginals = dist.beamSearch(100, Env.init, model.getInitialComputationGraph)
val parses = marginals.executions


trainingData: List[Parse] = List(NP -> (DT -> the, NN -> man), NN -> (NN -> blue, NN -> man))
examples: List[PpExample[Parse]] = List(
  PpExample(BindPp(BindPp(BindPp(ParameterPp(shift_the_DT,-1),<function1>),<function1>),<function1>),BindPp(BindPp(BindPp(BindPp(ParameterPp(shift_the_DT,-1),<function1>),<function1>),<function1>),<function1>),org.allenai.pnp.Env@21f8a7b2,<function1>),
  PpExample(BindPp(BindPp(BindPp(ParameterPp(shift_blue_NN,-1),<function1>),<function1>),<function1>),BindPp(BindPp(BindPp(BindPp(ParameterPp(shift_blue_NN,-1),<function1>),<function1>),<function1>),<function1>),org.allenai.pnp.Env@54f9b8b9,<function1>)
)
oracle: PpLoglikelihoodOracle[Parse] = org.allenai.pnp.PpLoglikelihoodOracle@120a34cf
trainer: StochasticGradientTrainer = com.jayantkrish.jklol.training.StochasticGradientTrainer@590acf05
params: com.jayantkrish.jklol.models.parametric.SufficientStatistics = [TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights)]
trainedParams: com.jayantkrish.jklol.models.parametric.SufficientStatistics = [TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights), TableFactor([])(1 weights)]
model: org.allenai.pnp.PpModel = org.allenai.pnp.PpModel@3dab26a1
dist: Pp[Parse] = BindPp(BindPp(BindPp(ParameterPp(shift_the_DT,-1),<function1>),<function1>),<function1>)
marginals: org.allenai.pnp.PpBeamMarginals[Parse] = org.allenai.pnp.PpBeamMarginals@248e7ddd
parses: Seq[org.allenai.pnp.Execution[Parse]] = Array(
  [Execution NP -> (DT -> the, NN -> (NN -> blue, NN -> man)) 1.9013592629203682],
  [Execution NN -> (NN -> (NN -> the, NN -> blue), NN -> man) -5.70407778876114],
  [Execution NN -> (NN -> the, NN -> (NN -> blue, NN -> man)) -5.70407778876114]
)

The code above is nearly identical to the code for the xor example. We generate unconditional/conditional distributions over parse trees, then maximize loglikelihood with stochastic gradient descent. Parsing the same sentence now gives us a highly peaked distribution on the tree present in the training data.


In [0]:
def answerQ(question: String): Pp[String] = {
    for {
        sentence <- chooseSentence(question)
        answer <- chooseAnswer(sentence)
    } yield {
        answer
    }
}

def chooseSentence(question: String): Pp[String] = {
    sentences = retrieveSentences(question)
    for {
        scores <- map(sentences, sentenceNN _)
        sent <- choose(sentences, scores)
    } yield {
        sent
    }
}

def sentenceNn(sentence: String): Pp[Expression] = {
    // Build a neural net to score the sentence
}


Main.scala:23: not found: type Pp
              def answerQ(question: String): Pp[String] = {
                                             ^
Main.scala:30: not found: type Pp
} ; def chooseSentence(question: String): Pp[String] = {
                                          ^
Main.scala:31: not found: value sentences
    sentences = retrieveSentences(question)
    ^
Main.scala:33: not found: value map
        scores <- map(sentences, sentenceNN _)
                  ^
Main.scala:33: not found: value sentences
        scores <- map(sentences, sentenceNN _)
                      ^
Main.scala:33: not found: value sentenceNN
        scores <- map(sentences, sentenceNN _)
                                 ^
Main.scala:34: not found: value choose
        sent <- choose(sentences, scores)
                ^
Main.scala:34: not found: value sentences
        sent <- choose(sentences, scores)
                       ^
Main.scala:38: not found: type Pp
} ; def sentenceNn(sentence: String): Pp[Expression] = {
                                      ^

In [1]:
val data = List((answerQ("The thermometer ..."), "temperature"),
                (answerQ("What season occurs when ..."), "summer"))


Main.scala:24: not found: value answerQ
List((answerQ("The thermometer ..."), "temperature"),
      ^
Main.scala:25: not found: value answerQ
                (answerQ("What season occurs when ..."), "summer")) 
                 ^

In [ ]: